MacWorld 1999 July

home *** CD-ROM | disk | FTP | other *** search

/ MacWorld 1999 July / Macworld (1999-07).dmg / Shareware World / Info / For Developers / Mops 3.4.sea / Mops source / PPC source / cg1 < prev next >

Wrap

Text File | 1999-01-14 | 37KB | 1,125 lines

marker m__cg1 \ ========================================= \ POWERPC CODE GENERATOR \ ========================================= \ NOTE: all the technical documentation is at the end of this file, \ after the endload. true constant assertions? \ make it false when everything really works?!? PPC? [IF] false constant debug? [ELSE] false constant debug? [THEN] false constant 64bit? \ =============== OPTIMIZATION FLAGS =============== (* it's useful to be able to turn these off, to find those recalcitrant bugs - and in the case of range checking, to be able to turn it off for speed-critical tested code. Or for floating multiply-add, someone might want to turn it off to restore strict IEEE FP semantics. NOTE: the default for all these boolean values is TRUE. *) true value hoist? \ can we move ops out of loops if \ they're invariant in the loop? true value hoist_fetches? \ can we move fetches to as early \ as possible? true value allow_match? \ true if we can eliminate an op if \ we find the result already in \ a register. Basically the same \ as CSE (common subexpression \ elimination). true value move_by_recompiling? \ can we try to recompile an earlier \ op to avoid a reg move? true value optimize_leaf_calls? \ do we use the fast calling sequence \ for leaf definitions? true value optimize_branches? \ do we simplify branches over other \ branches, branches to the next \ instruction, etc.? true value cascade? \ do we combine ops into single \ instructions where we can? true value range_check? \ do we range-check array accesses? true value multiply-add? \ do we cascade a floating multiply \ followed by an add into a \ floating multiply-add instruction \ (or one of its variants)? \ =============== FORWARD DEFNS ================ forward .G forward .F forward .C forward .GS forward .CS forward .AL forward .FAL forward .FR forward ZS \ =============== OTHER GLOBALS ============== 0 value backstop_CDP 0 value fetch_backstop 0 value basic_block_start 0 value loop_start 0 value max_called_#PL 0 value max_called_#FPL false value equalizing? false value eq_block_recompiling_move? bytestring eq_ranges bytestring const_data bytestring sv_const_data 0 constant CR0 4096 constant sys_SP_framesize 8 constant FPcell (* We don't move the stack pointers every time we push and pop something - rather, we keep track of the accumulated offset here and only adjust when we have to. *) 0 value STK_OFFSET 0 value FSTK_OFFSET \ ================= OD FIELD VALUES ================== \ Here we define various values for OD fields. We generally use the same \ names as for the 68k code generator, although the values aren't all \ the same (which doesn't matter anyway). \ Mode values: enum { mdGPR mdBD mdAbs mdLit mdPC mdFPn } PPC? [IF] hexx [ELSE] hex [THEN] \ Flag byte bits: \ 0 constant flExt \ Sign extend 1 constant fbExt \ 1 constant flFP \ Floating operation \ 2 constant fbFP (* 2 constant flLit \ Floating Literal 4 constant fbLit 3 constant flFCR \ FPU constant ROM reference 8 constant fbFCR *) \ Operation type values (opType): \ 0 means empty, 1 to 7 mean not empty but unknown for some reason. \ So far we've only defined 1-3. 1 constant otUnknown \ isn't empty, but we don't know anything about \ the contents 2 constant otUnkStored \ ditto, but we've stored it and so might be \ able to optimize a fetch from the same \ location, even tho we don't know its type 3 constant otUnkPulled \ ditto, pulled from memory part of stack 7 constant otUnknownCodes \ all codes less than or equal to this are \ some kind of empty/unknown 8 constant otMove \ reg move \ the following are also on the 68k, and in target compilation they \ dispatch to us from the 68k interpreter using these codes, so we \ can't change them! 12 constant otMUL 13 constant otDIV 14 constant otUDIV \ the following are PPC only: 10 constant otMULH \ multiply high 11 constant otUMULH \ multiply high 16 constant otAddc \ 17 constant otAddic 17 constant otAdde 18 constant otAddze 19 constant otAddme 1A constant otSubfc \ 1C constant otSubfic 1B constant otSubfe 1C constant otSubfze 1D constant otSubfme \ these following are also 68k codes, and can't change 21 constant otADD 22 constant otSUB 23 constant otAND 24 constant otOR 25 constant otXOR 26 constant otCMP 27 constant otUCMP \ unsigned compare - PPC only 28 constant otNEG 29 constant otNOT 2A constant otShift 2B constant otShift&mask \ PPC only 2C constant otTrap \ ditto 30 constant otPMend \ End of integer ops \ 3F constant otFPcmp \ Floating-point comparison. A special case. 40 constant otFPstart \ Start of regular floating-point ops. Note \ these are NOT in the same order as the \ integer ops. 40 constant otFMOVE 41 constant otFADD 42 constant otFMUL 43 constant otFMADD \ multiply-add - PPC only 44 constant otFNMSUB \ negative multiply-subtract - AltiVec only 48 constant otFPnoncom \ The following FP ops are non-commutative 48 constant otFSUB 49 constant otFDIV 4F constant otFPcmp \ FP comparisons 160 constant otFPstore 162 constant otFPfetch 54 constant otFPmon \ The following FP ops are monadic (* we don't bother including these in the dictionary, but these are the values: 54 constant otFABS 55 constant otFNEG 56 constant otFSIN 57 constant otFCOS 58 constant otFTAN 59 constant otFATAN 5A constant otFSQRT *) 5F constant otFPend \ End of FP ops 60 constant otStore \ Store 62 constant otFetch \ Direct fetch \ otFetch \ constant otAt \ Indirect fetch - ends up being treated \ the same 62 constant otDUP \ Stack shuffling 63 constant ot2DUP 64 constant otDROP 65 constant ot2DROP 66 constant otSWAP 67 constant otOVER (* again we don't bother including these in the dictionary, but these are the values: 68 constant otNIP 69 constant otTUCK 6A constant otROT 6B constant otDOWN 6C constant ot2SWAP For FP stack shuffling, we'll use the corresponding opcodes in the range 72 - 7B. We'll use opcodes in the range 80 - A0 for unique AltiVec ops like vector splat that have no scalar equivalents: 80 vector splat 81 vector splat immediate 90 vector select 91 vector permute *) 200 constant otVecOffset \ all vector ops have this bit set so we \ can recognize them quickly. \ Subtype byte values: \ For comparisons, the top 4 bits of the byte give the condition register field \ bit which we need to branch on for this condition: LT = 0, GT = 1, EQ = 2. \ The low bit (bit 7) is 1 if the condition we want has a field bit value of 1. \ Bit 6 is 1 for unsigned, and bit 5 is 1 for comparisons with zero. 20 constant cmpNE 21 constant cmpEQ 00 constant cmpGE 01 constant cmpLT 10 constant cmpLE 11 constant cmpGT 02 constant cmpHS 03 constant cmpLO 12 constant cmpLS 13 constant cmpHI 24 constant cmpZNE 25 constant cmpZEQ 04 constant cmpZGE 05 constant cmpZLT 14 constant cmpZLE 15 constant cmpZGT \ here's a table to map our 68k comparison codes to the above PPC ones: PPC? [IF] createx comparison_codes 0 cx, \ 0 0 cx, \ 1 cmpHI cx, \ 2 cmpLS cx, \ 3 cmpHS cx, \ 4 cmpLO cx, \ 5 cmpNE cx, \ 6 cmpEQ cx, \ 7 0 cx, \ 8 0 cx, \ 9 0 cx, \ A 0 cx, \ B cmpGE cx, \ C cmpLT cx, \ D cmpGT cx, \ E cmpLE cx, \ F decimalx [ELSE] create comparison_codes 0 c, \ 0 0 c, \ 1 cmpHI c, \ 2 cmpLS c, \ 3 cmpHS c, \ 4 cmpLO c, \ 5 cmpNE c, \ 6 cmpEQ c, \ 7 0 c, \ 8 0 c, \ 9 0 c, \ A 0 c, \ B cmpGE c, \ C cmpLT c, \ D cmpGT c, \ E cmpLE c, \ F decimal [THEN] 0 value OPERATION 0 value SUBOPERATION \ Some types of instruction need special treatment - e.g. AND etc. use the \ rA field for the destination. So we define some types: enum { noType loadStoreType arithType logicalType cmpType branchType shiftType vecType } \ ================= UTILITY WORDS ================== : MONADIC? \ ( opcode -- opcode b ) otNeg otNot within? IF true EXIT THEN otFPmon otFPend within? ; : MEM_REFERENCING? \ ( opcode -- b ) $ FF and SELECT[ otFetch ], [ otStore ]=> true DEFAULT=> drop false ]SELECT ; PPC? [IF] : dasm ; \ can't disassemble in native mode yet! : z ; [ELSE] forward dasm \ disassembles what we've done so far forward dcurr \ disassembles the current defn (even if not finished) forward Z \ ends ppc compilation and disassembles. [THEN] \ GetImmediateOp does the same when we're going to execute the operation \ now, returning the xt of the word to execute. : GETIMMEDIATEOP { opType subtype -- xt } opType SELECT[ otAdd ]=> ['] + [ otSub ]=> ['] - [ otAND ]=> ['] and [ otOR ]=> ['] or [ otXOR ]=> ['] xor [ otMUL ]=> ['] * [ otDIV ]=> ['] / [ otNEG ]=> ['] negate [ otNOT ]=> ['] not [ otCMP ]=> subtype CASE[ cmpNE ]=> ['] <> [ cmpEQ ]=> ['] = [ cmpGE ]=> ['] >= [ cmpLT ]=> ['] < [ cmpLE ]=> ['] <= [ cmpGT ]=> ['] > [ cmpZNE ]=> ['] 0<> [ cmpZEQ ]=> ['] 0= [ cmpZGE ]=> ['] 0>= [ cmpZLT ]=> ['] 0< [ cmpZLE ]=> ['] 0<= [ cmpZGT ]=> ['] 0> DEFAULT=> cr .h ." undef otCMP subtype in getImmediateOp" 1 die ]CASE [ otUCMP ]=> subtype CASE[ cmpHS ]=> ['] u>= [ cmpLO ]=> ['] u< [ cmpLS ]=> ['] u<= [ cmpHI ]=> ['] u> DEFAULT=> cr .h ." undef otUCMP subtype in getImmediateOp" 1 die ]CASE [ otShift ]=> subtype CASE[ 0 ]=> ['] << [ 1 ]=> ['] >> [ 3 ]=> ['] a>> DEFAULT=> cr .h ." undef otShift subtype in getImmediateOp" 1 die ]CASE DEFAULT=> cr .h ." undef op passed to getImmediateOp" 1 die ]SELECT ; PPC? not [IF] \ we put this in pnuc3 in target mode \ 16bits? ( n signed? -- n b ) \ returns true if n will fit in 16 bits (signed or unsigned as requested). : 16BITS? \ ( n signed? -- n b ) IF -32768 32767 within? ELSE dup 16 >> 0= THEN ; [THEN] : SIGNED? \ ( operation - b ) \ Returns true if this is a signed op - assumed to be the \ default. We just return false for the specific unsigned \ ops, and true for everything else. SELECT[ otUDIV ], [ otAND ], [ otOR ], [ otXOR ], [ otUCMP ]=> false DEFAULT=> drop true ]SELECT ; : REVERSE_COMPARISON \ called if we swap operands of a compare. Adjust \ subOperation appropriately. subOperation $ 24 and ?EXIT \ out if monadic or EQ or NE $ 10 xor> subOperation ; \ This must come before we redefine @ABS below! PPC? not [IF] from pasmMod import{ :PPC_code ;PPC_code disasm disasm_word disasm_xt disasm_rng disasm_cnt disasm_one set_disasm_call_range } compile: pasmMod \ While still in 68k-land, we need a PPC-style reloc! and @abs. The proper \ PPC versions will be compiled in pnuc3 and setup respectively. Note, \ any changes must of course be made in both places. : RELOC! { theAddr dest -- } theAddr addr>S&D $ ffffff and swap 24 << or dest ! ; : reloc, DP reloc! 4 ++> DP ; : relocCode, CDP reloc! 4 ++> CDP ; : displCode, CDP displ! 4 ++> CDP ; \ In the target compilation, we only have 4 segments. : @ABS { addr \ relocAddr seg# displ -- absAddr } addr @ -> relocAddr relocAddr $ ffffff and -> displ relocAddr 24 >> -> seg# seg# SELECT[ 8 ]=> code_start [ 9 ]=> data_start [ 10 ]=> seg_code_start [ 11 ]=> seg_data_start DEFAULT=> ." not a reloc addr" 1 die ]SELECT displ + ; : @abs6 @abs ; \ a variant name we can call later when @abs \ has been redefined in the PPC image [THEN] endload ======================= TECHNICAL DOCUMENTATION ======================= Here we describe all the nuts and bolts stuff relating to PPC code generation. We've collected together a whole lot of commentary from all over our source files -- it should be easier to find things if they're all in one place, and we can also cut and paste it into the manual. The general way to generate optimized code for an architecture with n registers is to analyse each basic block and generate a directed acyclic graph (DAG) whose nodes represent each value generated and whose edges represent uses of those values. Note it isn't a tree, since more than one edge can go into a node. Then optimization can be done, and common subexpressions (i.e. nodes) combined. Then, using graph coloring or whatever, the nodes are assigned to registers. This assumes there are less regs than nodes, so that we have to find all dependencies, and re-use a reg when its old value isn't needed any more. Here we can simplify things a little, and manage everything in one pass. On the PPC we have so many regs that for normal shortish basic blocks, we will generally be able to simply assign a different reg for each node. On the few occasions when we don't have a free reg, we can select one which has a value with no outstanding references - there's a slight chance we could have used this value in a subsequent op, but this is pretty well negligible. We can optimize on the fly: each time we get a new value, we can search the reg set for a match so we can use a combined node. This code is invoked as follows: Whenever Handlers is called to generate 68K code, if PPC? is true, the Handlers selector and opcode is pushed and PPCvec is executed. We will set PPCvec to point to PPC_compile which will use the selector to dispatch to the right routine. ============= MEMORY ARCHITECTURE ================ On the PPC, code and data is normally kept separate, with the code being read-only. The PEF format defines separate code and data sections, and at launch time these are placed in pointer-based blocks in memory. We ought to conform to this convention for installed apps, since it will give the best performance. In the development environment, our dictionary should be in a read-write block for obvious reasons. We'll define a separate data area, which will become the data section of an installed app. These two areas can be in handles, which will allow us to resize them on the fly if necessary. At installation time, we'll generate a PEF in which the dictionary is added to the existing code section, and the data area is added to the data section. Our initial nucleus will probably be generated exactly this way, either from the 68k or PPC version. Thus our initial nucleus is simply an application, which transforms itself into the Mops development environment by creating a handle for the new dictionary and data areas. Now, we could allow the new dictionary area to link itself to the old one by some cleverness, but since the dic would only be getting split at one place I don't think it would be worth the complexity. Better to just BlockMove the old dic area to the start of the new when the dev environment is being set up, and likewise for the data area. Colon uses a new header format incorporating two flag bytes in addition to what we use on the 68k, and also has to observe 4-byte alignment for the code. For this reason we generate a full PPC-style header (including 4-byte alignment) for colon definitions. When we compile a call to a colon defn we have to do PPC-style alignment even on the 68k. For other types of definition we don't have to bother, and the code should work on both platforms (since on the PPC FIND will look after the extra alignment). ============= REGISTER DEFINITIONS: ================ We need 3 regs for scratch in boilerplate code sequences. We'll use r0, which is a bit unusual anyway, and r11 and r12 (see below). r1 is the stack pointer, r2 is RTOC. We can't monkey with these. r11 and r12 are used in the calling sequence for external calls. Apple says they can be used as scratch at all other times, so we'll use them in boilerplate sequences. PowerMacForth does this too, and in the assembler they're aliased as rX and rY. For external calls, r3-r10 are used for parameters, and r3 for a returned result. They won't be saved over the calls. Of course for internal Mops calls we can do what we like. We can use these regs for general operands, and on an external call normalize the stack so that the required number of cells are stored in r3 on. At that stage we won't have any other cached stack cells, so we don't need the regs preserved anyway. This scenario gives us 8 regs for general operands, i.e. cached stack cells (r3-r10), which should be enough. If it turns out not to be enough we could grab a couple of the regs we've allocated for locals (see below). r13-31 are "non-volatile" - they're saved over external calls. For internal Mops calls we just need to save the locals, since other regs like the base address regs don't get changed. Now for the special regs we need. These all need to be saved over external calls, and so are in the non-volatile block. For addressing the dictionary, a difference to the 68k version is that we need to keep code and data separate (see above). In the development environment these will be in handle-based blocks, and in an installed app they'll be defined in the PEF which will make them end up in pointer- based blocks. Anyway, as long as we handle the addressability questions properly, it shouldn't matter where they are. Code references will be for branches, constants and other constant data like literal strings and class info. Constants can always be handled via literal instructions. Even if it needs more than 16 bits it can be generated in a reg with 2 instructions which will be faster than a memory reference. Branches will always be self-relative, and they have enough displacement bits to get us anywhere. For other constant data, however, it's extremely handy to have a base reg available, even if these references aren't performance-critical. Thus, if we still have modules (which is still up for grabs), we'll need 4 base regs - 2 to address code and 2 for data. Note that on entry, r2 (RTOC) always points to the start of the data area as defined in the PEF. But we can't use RTOC as a regular base reg, since in the dev environment our data area will be off in a separate handle. Other regs we need are RP (return stack pointer), the loop variable I, and the base address of the current object. Now we may as well use one of our "local" registers for I, since it will be very rare for us to need all of them. This will mean one less local in definitions that use I, but that's not a problem. It seems we should go for a separate FP stack, since the Scientific Library is now using this. This will also give better code on the PPC, since when we bring stack cells back from mem to regs, we'll always know which regs to move them to. The floating stack pointer probably should be in a register. So in all we'll need 7 special regs out of the non-volatile block. This leaves r19-r31 for locals, which means that if we limit the number of locals to 13, we can keep them all in registers. This looks reasonable. Some notes on register handling across internal Mops calls: Saving and restoring local regs can be a bit long-winded, so to save space we should normally do what we do on the 68k - that is, at the start of each defn, save whatever regs we need for locals, and restore them at the end. For EXIT, instead of doing everything inline as we did on the 68k, we'll do a branch to the semicolon. This is almost as fast (esp. as it's an unconditional branch), and saves space. We'll probably make an exception for leaf procs, since these get executed so frequently. What I'm currently planning, in the case where the leaf proc has named parms/locals, is to do the houskeeping in the calling routine instead of in the called leaf proc. This will give me the possibility of generating the parms straight in the required regs. Also I might be able to do the saving and restoring of the needed regs at the beginning and end of the calling routine (depending on what parms/locals that routine might need. This would get these housekeeping operations out of any inner loops. This alternative calling convention should certainly be faster, but will take a lot more space, so I won't do it all the time. The 2 flag bytes are organized as 4 nybbles: nybble 0: bit 0: 1 if it's a leaf defn bit 1: 1 if it alters the count register bit 2: spare bit 3: 1 if it alters the FP or vector regs. nybble 1: number of results in registers on exit nybble 2: number of named parameters nybble 3: number of named parms + locals (doing it this way is a bit more convenient, and the max number of parms+locals is only 11, so we have enough bits) If the definition does any floating point or AltiVec operations, or anything that alters the FP or vector registers, we need more flags, so we add an extra 32 bits. For alignment, we have to take at least 32, so we might as well make the most of it. byte 0: $ BB - this is basically just for the disassembler. byte 1: (if we end up with a vector stack, we might use bytes 0 and 1 as 4 nybbles as for floats.) bytes 2 and 3 are 4 nybbles: nybble 0: nybble 1: number of floating results in registers on exit nybble 2: number of named floating parameters nybble 3: number of named floating parms + locals =============== FINALIZATION OF DEFINITIONS ================= At semicolon time, there are a number of things we have to fix up in the definition we just compiled. Once we've compiled the prolog and epilog and added the const_data, if any, the final location of the code is known. We can then "finalize" the definition. This includes resolving any EXITs and LEAVEs and putting in the correct offsets for calls to other words (these couldn't be determined before the code's final location was determined). To handle these various things, we use some pseudo-instructions to stand in place of the final instructions we're going to put in those same locations at finalization time. To finalize, we look through the whole definition for these pseudo-instructions, and take the appropriate action. For the pseudo-instructions, we have to use opcodes that can't ever be used for real instructions. Now we'll never use the lmw and stmw instructions, since Keith D has warned me that they'll go away in future! This means that we can use their values as pseudo-opcodes, since we know that Mops will never generate them with their proper meaning. The values in question are primary opcodes 46 and 47, which means $B8xxxxxx to $BFxxxxxx. We also define our handler codes into this same range, since the two-byte handler code appears on an aligned boundary. This will prevent a handler code ever being mistaken for an instruction. Thus all our handler codes and pseudo-instructions identify themselves, which generally makes life easier. So far we've defined these: BAxx xxxx call to a Mops word (xx xxxx is code-relative offset) BBxx xxxx floating-point/AltiVec flag bytes. This code is basically for the disassembler, since it's aligned and 3 flag bytes are plenty. It shouldn't come up in finalization. BCxx all handler codes which have no boilerplate code (can't be EXECUTEd). Unlike on the 68k, xx is unsigned and not doubled (so we can have 256 codes if we need them). BDxx all other handler codes, except for colon defns. They can be EXECUTEd. BE00 handler code for a colon defn BE01 ditto, but means this is a forward defn. BE02 marks the start of :loc code BE03 marks the start of :mloc code BE04 handler code for a :ppc_proc defn BE05 handler code for :entry BE40 method (note BD40 is an inline method - so the 40 always marks a method) (we'll reserve BExx for any further options on colon definitions.) These next two can't ever appear inside a definition, so we give an error if they're encountered during finalization. BF01 handler code for SYSCALL and EXTERN BF0B handler code for LIBRARY BF0C 0000 case-sensitive name for :ENTRY follows BF02 0000 EXIT BF03 xxxx conditional EXIT (xxxx is cond. branch opcode) BF04 0000 LEAVE BF05 0000 LOOP BF06 0000 target of a forward defn. This marker is redundant, but makes the decompiler output look more sensible. BF08 xxxx unconditional branch. xxxx is offset (we only need 16 bits). BF09 xxxx ELSE - branch. xxxx is initially the offset back to the original conditional branch, in case we delete this branch and need to adjust. Once the branch is resolved, and we know it won't be deleted, xxxx becomes the branch offset as for other unconditional branches. We can tell which is which, since the first offset is negative and the second positive. BF0A replace with a literal load into r0 of the distance the code is moved. Used in generating the addr of a location within the current definition (since we don't know until the end how far it might be moved). ==================== EXTERNAL CALLS ====================== CALL_EXTERN handles an external call. This requires that we set things up as the PowerPC volume of IM says: 1. We have a pointer which is resolved by the CFM - this will be the address of a transition vector. This pointer will be in the data area (since it gets changed), and has a reloc addr pointing to it in the code area, which belongs to the SYSCALL or EXTERN word. We have to allow for new external calls to be asked for, then executed straight away, so we use a scheme where when we do an external call, we check whether the pointer has been resolved yet, and resolve it if it hasn't. We can easily tell, since we initialize each pointer to nilP, which is an illegal address. This test and the call to FindSymbol to resolve it, is in get_transition_vector which is called at the beginning of our external call sequence. We could save a couple of instructions by pre-resolving symbols that are already in the dictionary image, but it's not worth it - it's better to use just one scheme, and we do need to be able to resolve on demand, so that's the way we do it. The transition vector has 2 addresses - the addr for us to branch to, and the new RTOC value. The dest addr has to be loaded into the CTR or the LR for us to use it as a branch target. We use the CTR - see below for the reason for this. We want to load the dest addr as early as possible so that instruction fetching won't stall, so we do this part of the setup before we equalize the stack - during the equalization nothing needs the CTR anyway. We use r12 for the addr of the transition vector itself, as IM says. This also won't get messed with during equalization. We set up r12 and the CTR in get_transition_vector, as well as resolving the symbol as described above. Factoring as much as possible into get_transition_vector saves code space in the call sequence for external calls. So, as well as a bit of housekeeping, the main thing that CALL_EXTERN does is to compile a call to get_transition_vector. CALL_EXTERN then passes 1 as an "xt" to CALL_H. 1 can never be a real xt, since they must be even, so this tells call_h that this is an external call. CALL_H looks after everything from here on, including the stack equalization. 2. The first thing call_h does is pass 1 to EQUALIZE_FOR_CALL (in the equalization section). This gets the parameters into the right regs (and the parameter area, if necessary), as needed for external calls. IM envisages that setting the SP is already done by the prolog of the current routine, on behalf of all external calls that this routine makes. The parameter area is big enough for the call with the most parameters, and the others leave some unused space below the parm area (actually higher in memory). The parm area for each call must come immediately below the linkage area, so the callee can find it. But in Mops we have a separate data stack pointer, so we simply set up a linkage area for external calls using the system SP (gpr1) at startup time, and never change it after that. 3. call_h then calls COMPILE_EXTERN_CALL (in this file, below) to compile the actual call. To do this, we store our own RTOC into the linkage area (actually this is done once and for all at startup since we have a permanent frame for external calls), and load RTOC from the transition vector (still pointed to by r12). We then bctrl (branch and link to count register) to call the external code. (We could equally well have used the LR - see below.) Note: the standard sequence for cross-TOC calls in Metrowerks C is as follows. We do much the same, but in a different order - in particular we grab the dest addr and get it into the CTR as early as possible, before we normalize the stack etc., and we move the SP to allocate the parm and linkage areas on each call. We could equally well have used the LR instead of the CTR. MW have to use the CTR since they've done a bl to the out-of-line code, and have to preserve the LR. But the IBM manual recommends using the CTR for computed branches like this, to make life easier for debuggers etc, so that's what we'll do. inline: bl xxx lwz r2/TOC, $14(r1/SP) ... xxx lwz r12, <offs>(r2/TOC) / TOC entry is a pointer to transfer vector stw r2/TOC, $14(r1/SP) / Save RTOC lwz r0, (r12) / 1st entry in TV is destination addr lwz r2/TOC, $4(r12) / 2nd entry is new TOC addr - put in RTOC mtspr CTR, r0 / dest addr to CTR bctr / branch there =================== CLASS/OBJECT FORMATS ====================== =============== Object header ==================== Note if the obj is an ivar, it doesn't have a header if it's in a record, unless the ivar is indexed. Indexed ivars always have headers, no matter what, since the indexing code relies on it. PPC notes: we have to make some minor changes to the object header format for various reasons. 1. As objects live in the data area, we need a back pointer to the dic entry in the code area, so methods like .ID: work. 2. The class pointer is a 4-byte relocatable address, and we want it to be aligned. 3. The indexed length of the object now always occupies the 4 bytes at the end of the indexed descriptor, whereas on the 68k, although we allocated 4 bytes, we normally only took notice of the low 2 bytes. We want the 4 bytes preceding the obj's data to look like a negative indexed length if the object isn't indexed. So we end up with this layout: 4 bytes Back pointer to cfa of obj's dic entry (relocatable). Zero if no dic entry, or if no name field there. 4 bytes Class pointer (relocatable). 2 bytes Self relative offset to the class pointer. For simple objects (i.e. not embedded), this is -4. For embedded objects, it will be more negative. Note it will always be negative. 2 bytes Self relative offset to the indexed area. If not indexed, this will be 2, pointing to the next byte which is the first data byte of the object. This means, if we erroneously try to access the indexed area of a non-indexed object, we'll get sent to the non-indexed data area, and try to interpret the preceding 4 bytes as an "indexed length" (which is what precedes a valid indexed area). But these 4 bytes are these two offsets, which will always appear as a negative number. If we're doing range checking, this will always trap. We did a similar sneaky trick on the 68k, but the field lengths/positions were different. (object's data starts here) This format means that when multiply inheriting, we need 4 bytes separating each group of ivars ("embedded object"), not 2 as on the 68k. For indexed objects, the indexed area (after the ivars) is preceded by the indexed descriptor (xdesc) with the following format. This format is the same as on the 68k, except that it starts off-aligned, i.e. the #elements field is 4-byte aligned. 2 bytes Width of indexed elements (in bytes) 4 bytes Number of elements minus 1 (i.e. LIMIT-1). We can check this with a trap instruction. ============== Class dictionary entry ================ link/name/hndlr as for normal words - normal class hndlr is $BC1D, and for imported classes, $BC2D. Note that the offsets are defined in 2 places, which must agree! - here, and in pNuc4. 2 bytes flags (note - we're now aligned) (offs 2) 32 bytes links to 8-way hashed method chains (relative) (offs 34) 4 bytes link to ivar chain (relative) (offs 38) 2 bytes padding for alignment of the n-way, and to put the next fields in the same place as on the 68k, which simplifies (findM). (offs 40) 2 bytes non-indexed data length (offs 42) 2 bytes width of indexed elements, or zero if not indexed (offs 44) 2 bytes "xdispl offs" - the ivar offset where indexing starts (used by large_obj_arrays), or zero if none. (offs 46) 4(n+1) bytes n-way to superclasses (n relocatable addrs terminated by zero) Flag bits: $0001 "large" - indexed with > 64K elements. $0002 class is exported from a module $0004 class is general $00X0 if nonzero, an object of this class could go in a register: 3 gpr 4 fpr 5 vector $0n00 data must be aligned by 2**n. If zero, we use the normal default alignment. (So far we only use this for vectors, which must be 16-byte aligned.) $8000 class is META (we use this to terminate NW_IVsetup) Note: on the 68k, at the class cfa there was a call to BLD, the word which built an object. These 4 bytes also served as a unique marker identifying a class dic entry, since we did a JSR to BLD, and this always had the same bit pattern. We treated the class handler code the same as for a colon definition, and simply BSR'd to the cfa. On the PPC we'll really use class_h, so that a class won't look like a colon definition any more. This is a bit more logical and shouldn't cause any problems. In any case, we couldn't use a call to BLD as a unique marker, since calls are all self-relative on the PPC so that calls to any particular word will have a different bit pattern depending on where they are. class_h is $BC1D, and will give an error if we try to EXECUTE a class (as with all $BCxx codes). I doubt this will break any existing code. Note also we've moved the "flags" from after the indexed width item, up to be the first item. This is so the 4-byte items come out aligned without sundry padding. ============== ivar dictionary entry ================ \ note: this format is the same on the PPC and 68k, for once! 4 bytes hashed name 4 bytes link to prev ivar dic entry (self-relative addr) 4 bytes class pointer (relocatable) 2 bytes offset of this ivar's data from the base addr of the class 2 bytes number of elements if indexed, or zero if not 2 bytes flags Flag bits: $0001 ivar gets an object header $0002 this is a static ivar $0004 this is a public ivar $XXX0 this ivar is in a register (it's actually a temp object). Reg bits are as for class flags above, in the lo 4 bits, and the actual reg number in the next 5 bits. So far the hi 3 bits are free (zero). Note: although indexed objects can have 2^^32 elements, we are assuming that an ivar can't have more than 64K elements. This is because we are limiting the maximum ivar length of a class to 64K bytes, which is a stricter condition. Would anybody want a longer ivar than this?? ============== Method dictionary entry ================ 4 bytes hashed name 4 bytes link to prev method dic entry (self-relative addr) 2 bytes method flags 2 bytes 0 (alignment) 2 bytes $BE40 - "handler code" - not actually used as a handler, but marks this as a method for decompiler etc. Note that an inline method uses the code BD40, so I can just look for 40 in the low byte to see if it's a method. 2 bytes flag bytes as for colon defns, giving number of named parms etc. This is the method's cfa (xt) here. (method code follows) Method flag bits: $0001 private method (note other way round to ivars - we're using 1 for the unusual case) $0080 there's a callFirst and/or callLast method Note that the method code starts 6 bytes later than in the 68k version. ========================================================== *)